Using Purrr

ESM 244 2024

Nathaniel Grimes

Bren School of Environmental Science

2024-02-12

Packages to follow along

library(tidyverse)
library(knitr)
library(kableExtra)
library(tictoc)
library(furrr)

What are lists?

What are lists?

Extract information using [] and [[]]

example=list("apple",data.frame(x=seq(1,5),y=seq(11,15)),lm(mpg~cyl,data = mtcars))

print(example)
[[1]]
[1] "apple"

[[2]]
  x  y
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15

[[3]]

Call:
lm(formula = mpg ~ cyl, data = mtcars)

Coefficients:
(Intercept)          cyl  
     37.885       -2.876  
mod<-example[[3]]

mod

Call:
lm(formula = mpg ~ cyl, data = mtcars)

Coefficients:
(Intercept)          cyl  
     37.885       -2.876  

purrr has useful functions to help us deal with lists

ggplots and all stats models in R are stored as lists

Data science relies on computer iterating power

Computers do thousands to millions of tasks, really fast

For loops are useful tools, but have limitations

  • Doesn’t integrate into tidyverse very well

  • Difficult to interpret

  • Nesting is confusing

  • One error breaks the whole process

  • Can be slow if not constructed properly

R is a functional (i.e we use functions to get stuff done) language, why not use a functional iterative process?

Introducing mapping with purrr

map(.x,  # What are we evaluating or passing through to a function?
    .f,  # The function itself
    ...  # Extra features for the function or mapping options
    )

Because the first element is data or information, map works really well in pipes

Quickly apply functions to all columns of data

mtcars %>% 
  map(mean)%>% 
  kable() %>% 
  kable_classic()
x
20.09062
x
6.1875
x
230.7219
x
146.6875
x
3.596563
x
3.21725
x
17.84875
x
0.4375
x
0.40625
x
3.6875
x
2.8125

Map returns the results as a list

  • You’ve encountered lists before. Regression outputs, ggplots, etc. are all stored as lists in R. They are the most flexible storage object followed by tibbles.

Checkout the cheatsheet for all the information

map streamlines workflow

Imagine you are tasked with running regressions over subsets of data each with different regressions specifications how would you do it?

mod1=lm(y~x,data=filter(thing1,df))
mod2=lm(y~x,data=filter(thing2,df))
....

Map can apply regressions to any number of subsets

mtcars %>% 
  split(mtcars$cyl) %>%   #split is a base R so not part of the tidyverse
  map(~lm(mpg~wt,data = .)) # the . is passed from the pipe with .x of map

Only 3 cylinders in the dataset, but there could have been 1,000 and the code above would store 1,000 regression models

Map over multiple lists with map2 and pmap

map2(.x,  # dataset 1
     .y,  # dataset 2
     .f(x,y,...),  # the function the accepts dataset1 and dataset2 
     ...
     )


by_cyl <- mtcars %>%  split(mtcars$cyl)  # Store the dataset for predictions

mods <- by_cyl %>%  map(~lm(mpg ~ wt, data = .))

predictions<-map2(mods, by_cyl, predict) # Take my linear mods and use the data to predict mpg

pmap lets us put in any number of inputs as a list

map requires careful thought

  1. What am I trying to accomplish?

  2. How is my data currently stored?

  3. How will my data be passed to map?

  4. Build (or use) a function that accepts everything

I’ve used map to iterate bioeconomic models over 100,000s of parameter combinations, and run machine learning algorithms on 1000s of datasets

Beyond this class you will encounter big datasets, purrr is a great way to handle enormous data.

Extra Fun Things

map can catch errors

  1. Won’t break the entire process so you have to start again. Nothing hurts more than running code for hours to get an error at the end.

  2. You know what models ran into errors in the future

safely() and possibly() store errors and allow you to return errors respectively

# Make up some data
dat = structure(list(group = c("a", "a", "a", "a", "a", "a", "b", "b", "b"), 
                     x = c("A", "A", "A", "B", "B", "B", "A", "A", "A"), 
                     y = c(10.9, 11.1, 10.5, 9.7, 10.5, 10.9, 13, 9.9, 10.3)), 
                class = "data.frame", row.names = c(NA, -9L))

#Define safe lm function
safelm=safely(.f=lm)

dat %>% 
  split(dat$group) %>%
  map(~safelm(y~x,data=.x)) %>% 
  map("error") # Pull out errors
$a
NULL

$b
<simpleError in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels>

Sets up parallel processing!

The furrr package allows you to access your computer or server’s multiple cores

Speeds everything up exponentially

Works just like purrr map, but sends blocks of data to your computers other cores

library(tictoc)
library(furrr)

plan(sequential)

#Run code to pause the computer for two seconds
tic()
test_slow<-future_map(c(2,2,2),~Sys.sleep(.x))
toc()
6.15 sec elapsed
#tell the computer we want to use three of our cores
plan(multisession,workers=3)

#Start a timer then run code to pause the computer for 2 seconds
tic()
test_fast<-future_map(c(2,2,2),~Sys.sleep(.x))
toc()
3.94 sec elapsed

Progess bars

Tired of waiting for code to run and you have no idea how long it will take?

purrr comes with built in progress bars to see how long its taking

library(purrr)
x<-map(1:50,\(x) Sys.sleep(0.1),
       .progress=TRUE)
# Notice I used an anonymous function